Future-proofing CSV Column-creep with the `janitor` Package

Matt Pettis
2018-10-18

Who I Am

  • Principal Data Scientist, Honeywell
  • R user, since about 2010, according to my internet questions

Check these features

Baby Bear

... Now these features

Momma Bear

Moar features !!1!

Cthulhu

You'll probably keep getting more columns, tbh

itcrowd

  • It's a problem
  • … beause you hand-changed the column names.
  • … and there's got to be a better way…

Say...

itcrowd

… there's an easy, programatic way to do this?

To the notebook!

Make a CSV file-loading workflow

  • Don't copy and paste around the logic for loading a CSV file.
    • Define it once and wrap in a function.
    • Keep the loading functions in a library.
    • Source that library to scripts where you want to use data.
    • Give the option to return just data, or data and new -> old column name mappings.

To sum up...

  • Clean your column names programatically with janitor.
  • Always record your old and new column names.
  • Make your CSV load definitions once and wrap them in a function.
  • Keep your load definition functions in a source-able library file.

Thank You

This repo: https://github.com/mpettis/r-janitor-talk